76

Algorithms for Binary Neural Networks

TABLE 3.4

With different λ and θ, we evaluate the accuracies of BONNs

based on WRN-22 and WRN-40 on CIFAR-10/100. When

varying λ, the Bayesian feature loss is not used (θ = 0).

However, when varying θ, we choose the optimal loss weight

(λ = 1e4) for the Bayesian kernel loss.

Hyper-param.

WRN-22 (BONN)

WRN-40 (BONN)

CIFAR-10

CIFAR-100

CIFAR-10

CIFAR-100

λ

1e3

85.82

59.32

85.79

58.84

1e4

86.23

59.77

87.12

60.32

1e5

85.74

57.73

86.22

59.93

0

84.97

55.38

84.61

56.03

θ

1e2

87.34

60.31

87.23

60.83

1e3

86.49

60.37

87.18

61.25

1e4

86.27

60.91

87.41

61.03

0

86.23

59.77

87.12

60.32

3.7.7

Ablation Study

Hyper-Parameter Selection In this section, we evaluate the effects of hyperparameters

on BONN performance, including λ and θ. The Bayesian kernel loss and the Bayesian

feature loss are balanced by λ and θ, respectively, to adjust the distributions of kernels and

features in a better form. WRN-22 and WRN-40 are used. The implementation details are

given below.

As shown in Table 3.4, we first vary λ and set θ to zero to validate the influence of

Bayesian kernel loss on kernel distribution. The utilization of Bayesian kernel loss effectively

improves the accuracy on CIFAR-10. However, the accuracy does not increase with λ,

indicating we need not a larger λ but a proper λ to reasonably balance the relationship

between the cross-entropy and the Bayesian kernel loss. For example, when λ is set to

1e4, we obtain an optimal balance and the best classification accuracy.

The hyperparameter θ dominates the intraclass variations of the features, and the effect

of the Bayesian feature loss on the features is also investigated by changing θ. The results

illustrate that the classification accuracy varies similarly to λ, verifying that Bayesian feature

loss can lead to a better classification accuracy when a proper θ is chosen.

We also evaluate the convergence performance of our method over its comparative coun-

terparts in terms of ResNet-18 on ImageNet ILSVRC12. As plotted in Fig. 3.22, the XNOR-

Net training curve oscillates vigorously, which is suspected to be triggered by a suboptimal

learning process. On the contrary, our BONN achieves better training and test accuracy.

Effectiveness of Bayesian Binarization on ImageNet ILSVRC12 We experimented

by examining how each loss affects performance better to understand Bayesian losses on

the large-scale ImageNet ILSVRC12 dataset. Based on the experiments described earlier, if

used, we set λ to 1e4 and θ to 1e3. As shown in Table 3.5, both the Bayesian kernel

loss and Bayesian feature loss can independently improve the accuracy on ImageNet. When

applied together, the Top-1 accuracy reaches the highest value of 59.3%. As shown in Fig.

3.21, we visualize the feature maps across the ResNet-18 model on the ImageNet dataset.

They indicate that our method can extract essential features for accurate classification.

TABLE 3.5

Effect of Bayesian losses on the ImageNet data

set. The backbone is ResNet-18.

Bayesian kernel loss









Bayesian feature loss









Accuracy

Top-1

56.3

58.3

58.4

59.3

Top-5

79.8

80.8

80.8

81.6